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ABSTRACT 



Three examinations ad^ninistered to medical students 



were analyzed to determine differences among severities of judges' 
assessments and among grading periods. The examinations included 
essay, clinical, and oral forms of the tests. Twelve judges graded 
the three essays for 32 examinees during a 4-day grading session, 
which was divided into eight half -day grading periods. Eighteen 
judges graded the performance of 217 examinees on the clinical 
examination during a 2-day grading session that was divided into four 
grading periods. Forty-six judges graded the performance of 270 
examinees on the oral examination during a day and a half grading 
session that was divided into three grading periods. An extension of 
the Rasch model was used to analyze facets for examinees, items, 
judges, and grading periods. This study focused only on judge 
severities and difference > among grading periods, however. The system 
of links necessary to calibrate judge severities and grading periods 
as separate facets was adequate because judges had 16 primary 
protocols and some examinees in common. Data from each of the three 
examinations were analyzed using FACETS, a computer program for Rasch 
analysis of examinations with more than two facets. The FACETS 
program estimates objective and conjointly additive calibration , 
standard errors, and fit statistics for each element of each facet in 
the examination. Significant variation in judge severities and some 
variation across grading periods were found on all three 
examination.^. However, the fit statistics confirm that most judges 
are reasonably consistent in the application of their individual 
level of rieverity. Four data tables and four graphs are included. 
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Severity of Grading Across Time Periods 
Abstract 



Three examinations which require judges to assess examinee performances 
were analyzed to determine differences among judge severities and grading 
periods. An extension of the Rasch model analyzed facets for examinees, 
items, judges and grading periods. Significant variation in judge severities 
and some variations across grading periods were found on all three 
examinations . 



Severity of Grading Across Time Periods 



Assessment of essay, oral, clinical or other examinee performances 
usually requires the intervention of a judge, The expectation is that 
examinee scores will be independent of the particular judges that grade the 
performance and the grading period. The reality, however, is usually more as 
Thurstone (1927) observed, that the discriminal process corresponding to a 
given stimulus varies among Individuals. 

The validity and reliability of examinations which require judges have 
been questioned because of judge subjectivity and potential bias (Hurley, 
1982) related to judges. Attempts to improve uniformity among judges have 
included constructing structured items such as essays or oral protocols, 
standardizing grading criteria and administration procedures, and providing 
extensive judge training. But these efforts have served only to direct the 
attention of judges, not to control the subjectivity of their assessments. 

Inconsistency among judges has been studied extensively. Littlefield, 
§1 al (1981) compared the ratings of various types of judges (i.e. faculty and 
residents) and found significant differences in their assessments of similar 
clerkships. A multiple choice examination was found t:o be more reliable than 
the clinical ratings. Lunz and Stahl (1990) found inter and intra judge 
inconsistency when pass/fail decisions about the same examinee performance 
were made using different scoring criteria. Cason and Cason (1984) postulated 
that the ratings received by a subject are a function of the subject's true 
ability and the rater's characteristics including the rator's resolving power, 
sensitivity and stringency. A significant rater stringency effect and a 
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significant student ability effect were found, de Gruijter (1984) 
demonstrated differences among judges using linear and nonlinear analysis 
models. Lunz , £t si (1989 and 1990) found that judges demonstrate discernable 
levels of severity which effect examinee scores on oral and clinical 
examinations. The results of these studies support the premise that judges 
have unique standards which interact w\th the examination materials and 
examinee performances resulting in differing levels of severity.^ 
Standardized grading criteria and administration procedures can define the 
examination process, but can not remove differences in judge severity. 

Examinations which require judges are often graded during defined 
grading sessions with delimited grading periods. Assuming that examinee 
ability is randomly distributed across grading periods, and that examinee 
performances are randomly allocated among judges, it is possible that some 
examinee perforwances are more or less severely graded during some grading 
periods. Thus the time of grading within the grading session in addition to 
the overall severity of the judge may influence the grade awarded. Braun 
(1988) found a sizeable shift in the average score of essay readers from day 1 
to day 2. 

Differences in judge severity and differences among grading periods for 
three examinations, an essay examination, an oral examination and a clinical 
examination will be explored. These three examinations have several 
attributes in common. Judges are needed to assess examinee performances and 
the grading sessions have defined grading periods. The judges come to a 
specific location to do the grading. The grading periods are contiguous. 

1 Severity is the term used to describe the unique perception of a judge in 
regard to the examination materials, the standards for competence and the 
application of the rating criteria. 



The elapsed time between grading periods ranges between one hour and 12 hours 
(overnight) « 

Data 

There is no overlap among the three examinations in regard to items, 
examinees or judges. If similar patterns of differences among judge 
severities and across grading periods are found for the three examinations 
this would imply that the patterns are not unique to a particular examination. 

The essay examination required the examinee to write three essays 
(items) so that their skill in english composition could be evaluated. Twelve 
judges graded the three essays for all 32 examinees during a four day grading 
session which was divided into eight half -day grading periods. Essays were 
graded on a nine point scale with 9 as excellent and 0 as unacceptable. A 
total of 27 points represented a perfect score on all 3 esrays . These data 
have complete overlap, that is, all judges graded all essays for all examinees 
sometime during the eight grading periods. There is no missing data. 

The clinical exa mination required examinees to prepare 15 histology 
slides (items) to detailed specifications. Eighteen judges graded 
performances from 217 examinees during a 2 day grading session, divided into 
four grading periods. It was impossible for all judges to "^rade all slides 
for all examinees (15 x 217 - 3255 slides), so examinee performances were 
allocated to judges. This introduced the opportunity for the severity of the 
judge, as well as the grading period, to influence the grade assigned. A 
rotation system enabled each judge to grade each of the 15 slides sometime 
during the two day grading session and some combination of three judges to 
grade subsets of an examination. This created the system of links necessary 
to calibrate judge severities and grading periods as separate facets. Even 
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though there is missing data, all judges have all slides and some examinee 
performances In common. 

A perfect score was 75 points (15 slides x 5 points « 75 points). There 
were three assessments for each slide. Quality of tissue cutting and 
processing were graded acceptable (1 point) or unacceptable (0 points). 
Tissue staining was graded on a four point grading scale: unacceptable (0), 
below average (1), average (2) and above average (3). This design did not 
have complete overlap of judges, items, or examinees but did have a series of 
links based on common examinee performances and common items which enabled a 
complete calibration of "^11 elements on one common scale (for a more complete 
explanation of linking see Lunz, Wright and Linarce , 1990). 

The oral ex amination required examinees to complete two twenty-minute 
interviews, each with a different judge. These interviews were face- to- face 
interactions between the examinee and the judge. Forty-six judges graded 270 
examinees during a day and a half grading session, divided into three grading 
periods. Twenty-seven structured protocols (items) were used for the 
examination (16 primary and 11 make-up). Each protocol described the nature 
of a case. The examinee then acquired additional information from the judge 
until a diagnosis could be made or a treatment determined. A four point 
grading scale was used on which 0 was unacceptable, 1 was below average, 2 was 
satisfactory and 3 was excellent. A perfect score was 18 points (6 protocols 
X 3 points - 18 points). 

It was impossible for all judges to grade all examinees (2 interviews x 
270 examinees - 540 interviews), so examinees were allocated to judges. This 
introduced the opportunity for the severity of the judge and the grading 
period f) influence the score assigned. A rotation system in which examinees 
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vere interviewed by two different Judges using different subsets of protocols 
enabled each judge to grade each of the 16 primary protocols during the first 
two grading periods. 

The third grading period was ieser\red. for "make-up" examinations. 
Examinees who were not determined no be clear passes or fails after two 
interviews were examined a third time with a different judge and different 
protocols during this third grading session. Eleven different protocols were 
used during this session by a subset of the judges. The judges knew these 
were ""make-up" examinees. 

The system of links necessary to calibrate judge severities and grading 
periods as separate facets was adequate because judges have the 16 primary 
protocols and some examinees in common. The overlap among judges, examinees 
and protocols is least definitively defined .cor this examination and there is 
missing data. 

Methods 

It is usually assumed that the results of an examination generalize so 
that sensible action can ensue. An examinee who pasnes an examination is 
certified as having demonstrated an acceptable level of skill and knowledge, 
regardless of the specific sample of essays, clinical slides or protocols and 
regardless of the particular judge or grading period, 

A measurement model designed to analyze an examination with multiple 
facets must provide an analysis of each of the elements in each facet of the 
examination. The particular elements within each facet must be calibrated in 
a way that is independent of the local distributions of the elements in the 
other facets. Thus, the positioning of examinee measures must function as 
though independent of which judges, items or grading p.jriods were encountered, 
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The two facet (dichotomous) Rasch model log{ Pni/(l-PnJ ) - Bn-D^ (Rasch, 
1960/1980) analyzes the two facets of item difficulty and examinee ability. 
An examination with three or more facets, will include facets for examinee 
ability and item difficulty as well as any other facets, such as judge 
severity or grading period, which may effect examinee scores. 

An extension of the Rcsch model to Include all facets which are 
pertinent to an examination was developed by Linacre (1989). The probability 
of person n with ability achieving rating step x on item i with difficulty 
Di from judge j with severity Cj during grading period T^ is modeled as 
lo6(Pnijtx/Pnijtx-i) - (K - D; - Cj - Tt - F^) (See Appendix 1 for explanation). 
This extended Rasch model constructs a variable, measured in log-odds units 
(logits), that quantifies the elements within each facet so that quantitative 
comparisons among and within the facets are possible. Each facet is 
calibrated from the relevant observed score.s and all but the examinee measure 
facet are centered at a common origin. 

The positioning of elements within each facet provides the frame of 
reference for verifying the intended examination definition. Examinee 
meas\ires (B^) are ordered from highest to lowest, judge severities (Cj) are 
ordered from most to least severe, and any differences among grading jj.eriods 
(Tt) are observable. It is also possible to observe how the grading 
categories (F^) are used by the judges and the ordering of the examination 
item difficulties. This study, however, focuses on differences in judge 
severities and differences among grading periods. The other facets are 
calibrated as part of the analysis, but will not be discussed. 



Data from each of the three examinations were analyaed using FACETS 
(Linacre, 1988) a computer program ior Rasch analysis of examinations with 
ffiore than two facets. Ti\e FACETS program estimates objective and conjointly 
additive (Luce & Tukey, 1964) calibrations, standard errors and fir. statistics 
for each element of each facet in the examination. The examinee raw scores 
are linearized and corrected for variations in the measured severities of the 
particular judges and grading periods encountered by an examinee. The 
importance of this correction depends on the overlap among judges, items and 
examinee performances. The more variable the combinations, the more important 
the correction to obtain objectivity. 

The fit statistics evaluate the suitability of the data for the 
construction of a variable and identify inconsistency for any element of any 
facet. Consistency verifies that these data are appropriate for making 
measures (Wright & Stone, 1979 chapter 4 and Wright & Masters, 1982 chapter 
5). The fit statistics for judges indicate the degree to which each judge is 
internally self-consistent (intra-judge consistency). Deviant judges can be 
flagged. Unexpected scores can be identified and their effect on examinee 
measures analyzed. The fit statistics for each grading period indicate the 
Inter-judge consistency among judges during that grading period. 

Two kinds of fit to the expectations of the model are reported. The 
infit statistic is an inform .ion v.'eighted mean-square residual which is 
sensitive to an accumulation of central or inlying deviations. The outfit 
statistic is an unweighted mean-square residual which is sensitive to 
occasional outlying deviations. The expected value for the mean squares is 
one (1.0) and their asymptotic standard errors are approximately the square 
root of (2/d.f.) where d.f. is the number of independent replications on which 
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the corresponding estimate is based. The region of acceptable fit will be 
mean squares greater than 0.5 and less than 1,5. Judges, or grading periods 
with infits or outfits beyond these criteria will be flagged and reviewed 
carefully for unexpected deviations. 

The eleaents in each facet are summarized by their estimated mean, 
standard deviation, reliability of element .reparation and corresponding chi- 
square for homogeneity. In most test situations, variation in examinee 
performance is expected. When all examinees take all items and all judges 
grade all examinees, the variations in judge severities do not produce unfair 
scores. But when judges are allocated to examinee performances and grading 
periods vary, variation in judge severities and grading periods can effect raw 
scores and should be accounted for before examinee measures are calculated. 

Separation reliability (similar to the KR-20) is the proportion of the 
observed variance in item difficulties, examinee measures, judge severities 
and grading period estimates not due to measurement error (Wright & Masters, 
1982 pp 91-94). The chi-square for homogeneicy tests whether the judges can 
be regarded as sharing the same severity after allowing for measurement error. 
A significant chi-square indicates that the variation in judge severities 
exceeds the error of measurement. 

To determine the effect of grading period on examinee measures, each 
examination was analyzed twice, first with grading period modelled as a facet 
and again with 'Ut grading period modelled as a facet. The second analysis 
assumes that *dl grading periods are comparable. In the essay examination 
where there is complete overlap, grading period should have no effect on 
examinee measures. For the clinical and oral examinations, in which judges 
are allocated to examinees, grading period may have an observable effect. 
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R esults 

Tables 1, 2 and 3 show the judge severity calibrations in order of 
severity for the essay, clinical and oral examinations and the summary 
statistics. For all three examinations calibrated judge severities show a 
range well beyond that expected due to error of measurement. The range of 
judge severities for the essay exam is .45 to -.30 logits and separation 
reliability is .82. For the clinical examination the range of judge 
severities is 1.21 to -.97 logits and separation reliability is .95. For the 
oral examination the range of judge severities is 1.67 to -1.58 logits and 
separation reliability is .36. The chi-square analyses for all three 
examinations confirmed that judge severities were significantly different 
(p<.00) . 

The fit statistics show intra-judge consistency for most judges within 
their level of severity. On the essay exam (Table 1) judge 4 is more 
consistent than expected (.5 infit & outfit). Review of the data found that 
this judge limited his use tf the rating scale to points 5, 6 and 7 of the 
nine points possible. Judge 12 verged on misfit (infit 1.4 and outfit 1.4). 
Judge 12 awarded some ratings that were unexpectedly low (1 or 2 points) givei 
his overall grading pattern. The examinees who received the low scores from 
this judge received relatively low scores from the other judges as well. All 
judges graded all examinees on all essays, yet measurably different judge 
severities are observable. 

On the clinical examination (Table 2) no judge was sufficiently 
inconsistent to be outside the region of acceptable fit. Judges, however, 
manifest measurably different levels of severity. 
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On Che fixal examination (Table 3), five judges, 10, 15, 19, 25, 36, show 
misfit. Review of their data revealed that these judges gave unexpectedly low 
grades to some examinees. Judges lO and 19 graded make-up examinees during 
the third grading period. These judges gave lower than expected scores 
(given their grading pattarns) to these less able examinees, which show as 
intra- judge inconsistencies. Judges 15, 25 and 36 each graded one examinee 
lower than expected on one protocol which caused the misfit. Again, judges 
manifest measurably different levels of severity. 

These analyses show that judges, regardless of the examination, can vary 
significantly in their severities but are generally consistent in their 
application of their level of severity across examinees. 

Table 4 shows the calibrations of the grading periods for the three 
examinations. For the clinical examination the judges are more severe in the 
second grading period but less severe in the fourth period. For the oral 
examination the judges are consistent across the three grading periods, For 
the e?say examination the judges are more severe during the third grading 
period. In all three examinations, the judges became less severe toward the 
end of the session. The infit and outfit statistics show that the grading 
period data fit the model. Both infit and outfit are within the acceptable 
region indicating inter- judge consistency within each grading period. 

A Chi-square analysis for homogeneity across grading periods found 
significant differences for the cl"' ical examination (x^ - 61.33, df - 3 , 
p<.00) and the essay examination (x^ - 17.90, df - 7 p<.00) showing that the 
severity of grading can change significantly across grading periods. There 
was not a significant difference across grading periods for the oral 
examination. 
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The examinee measures for the essay examination with grading period 
(time) calibrated and grading pericd (time) uncalibrated are presented in 
Graph 1. There is a near perfect relationship between the measures earned 
with and without time as a calibrated facet. This is because the examination 
had complete overlap of judges, items and examinees. All examinees were 
graded during all time periods removing any advantage due to time. 

The examinee measures for the clinical examination with time calibrated 
and time uncalibrated are plotted in Graph 2. Two distinct groups of 
examinees can be observed, those who are penalized and those who have an 
advantage due to the grading period. The examinee measures on line A were 
penalized due to the grading period, while those on line B had an advantage. 

The examinee measures for the o ral examination with time calibrated and 
uncalibrated are plotted in Graph 3. There is a slight advantage for some 
examinees due to grading period, although the effect is not marked because 
there is no significant difference among grading periods. Graph 3A, an 
enlargement of the measures around .00, shows that some examinees may have 
been penalized slightly due to grading period. The effect is not large, but 
it could have an impact on a few pass/fail decisions. 

Discussion 

These data deraonstiate that judges differ in their severities regardless 
of the examination. The fit statistics for all three examinations, however, 
confirm that most judges are reasonably consistent in the application of their 
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individual level of severity. When it is possible for all judges to grade all 
examinee performances across all grading periods, the unique effects of judge 
and grading period are neu\:ralized, as in thft essay examination. When, 
however, for reasons of money or time, it is necessary to allocate examinee 
performances to subsets of judges, the effects of judge severity and grading 
period become important. Correction for grading period and judge severity 
improves the examinee measures because it frees them from the effects of the 
particular judge and grading period encountered and makes them more objective. 

The Rasch fit statistics flag deviant grading patterns so that they can 
be reviewed. Misfit focuses diagnostic study of the data and provides 
specific information which can be shared with the judge. Detailed inf oi.*.ation 
about inconsistent grading can stimulate judges to think about their grading 
patterns and may lead to improved consistency. 

Short term effects such as fatigue and attitude may account for the 
changes across grading periods. One can imagine that at the beginning of a 
grading session, judges get "warmed up". After they get "warmed up" they 
grade seriously, perhaps more severely, for a while. But then, as the end of 
the session draws near they "ease up" a little. This is perhaps normal human 
behavior, but it may also penalize a subgroup of examinees. 

Training judges and developing detailed definitions of the scoring 
system and criteria help standardize the examination, After all reasonable 
efforts have been made to train judges, differences in severities are still 
observable. A training session of 2 to 3 hours may not be able to change 
ingrained personal expectations. It may be more reasonable to compensate for 
differences among judges than to attempt to make them comparable. 

The use of the Rasch model places responsibility on the analyst. There 
may be a danger that judge severities can be over or under calibrated thus 
making an unfair adjustment to an examinee measure. The misfit statistics 
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flag this possibility so the data can be reviewed. There is also the need to 
create a sound linking network of items, judges and examinees. The FACETS 
program calculates the error of measurement for each element calibration and 
examinee measvire. This quantifies the possible error associated with the use 
of the calibration or measure for decision purposes. 

Any subgroup of judges is unique, so examinees who happen to get more 
lenient judges have a raw score advantage over examinees who happen to get 
more severe judges. This inequity is well documented but has been ignored 
because reasonable tools for dealing with the problem were not available. The 
use of the extended Rasch model provides these missing tools. The whole 
process of dealing with examinations that require judges becomes less 
mystical, more quantitative and more understandable to both judges and 
psychometric experts. 
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TABLE 1 







Severity 


of Judges on 


Essay Exam in 














Order of Severity 








Judge 






Logit 




Infit 


vUUI It 


Number 


Score 


Count 


1 Judge 


Error 1 


MnSq 








of Essays | Severity 










Most 1 


296 


96 


1 0.45 


0.08 1 


0 


8 


0 R 


Severe 3 


337 


96 


1 0.18 


0.08 1 


1 


0 


1 0 


6 


338 


96 


1 0.17 


0.08 1 


1 


,0 


1 0 


5 


348 


96 


1 0.10 


0.08 1 


0 


7 


0 7 


10 


365 


96 


1 -0.01 


0.08 1 


0 


7 


0.7 


11 


370 


96 


1 -0.03 


0.08 1 


1 


2 


1 1 
1 • 1 


12 


374 


96 


1 -0.06 


0.08 1 


1 


4 




2 


377 


96 


1 -0.08 


0.08 1 


1 


1 


1.1 


7 


383 


96 


1 -0.12 


0.08 1 


1 


,1 




9 


388 


96 


1 -0.15 


0.08 1 


1 


0 


1.0 


Least 4 


389 


96 


1 -0.15 


0.08 I 


0 


5 


0.5 


Severe 8 


412 


96 


1 -0.30 


0.08 1 


1 


3 


1.2 


Mean: 


364.8 


96.0 


1 -0.00 


0.08 1 


1 


0 


1.0 


S.D. : 


29.5 


0.0 


1 0.19 


0.00 1 


0 


2 


0.2 



Fixed (all same) chi-square: 65.88 d.f.: 11 significance: 0.00 

RMSE - root mean square error of judge calibrations - .08 

Adj S.D. - square root of observed variance minus mean square error 

variance -.17 
Separation - (Adj S.D.)/RMSE - 2.16 

Separation reliability - (Separation) Vl+(Separation)2-, 82 
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TABLE 2 

Severity of Judges on Clinical Examination in 
Order of Severity 









uount 




Lai ID • 


Model 


Infit 


Outfit 




riu 


Score 


/N'T 

01 




Juoge 


Error | 


MnSq 


MnSq 








bilaes 




Severity 




1 








nos u 


1 n 

iU 


7Q 

to 


7<; 




1 .21 


0 


.19 1 


0. 


8 


0 . 8 




1 A 


1 no 


HA 


j 


1 no 
1 . Uo 


0 


.18 1 


0. 


7 


0 , 7 




p 

o 


Do J 


oiD 




r\ 7 n 
U . / U 


0 


.07 1 


0.9 


0 . 9 




1 
1 


i J / 


i J J 




U . JO 


0 


.15 1 


1. 


0 


1 . 2 




1 S 




/UD 




U • z J 


0 


.07 1 


1. 


2 


1 . 1 




1 

X O 


1 n SA 


RAH 




u , io 


0 


.07 1 


0 . 


o 
0 


U . 9 






Z / 0 


Z lU 




U . 1*1 


0 


.14 1 


1. 


2 


1 1 
1 . 1 




0 


77Q 


0 1 J 




U . iM- 


0 


.08 I 


1. 


0 


1 . 0 




16 


976 


750 




0.02 


0 


07 1 


1. 


0 


0.9 






1 uu^ 


7 t;n 




- U • i J 


0 


08 1 


1. 


1 


0 . 9 




7 


i J 7 0 


iU J J 


j 


-U . ztf 


0 


07 1 


1. 


0 


0 . 9 




2 


285 


210 




"0.39 


0 


15 1 


1. 


1 


1.0 




4 


985 


705 






0 


09 1 


1. 


0 


u . o 




11 


1333 


930 




-0.41 


0 


07 1 


1. 


2 


1.0 




18 


1078 


780 




-0.41 


0 


08 1 


1. 


2 


1.2 




12 


886 


630 




-0.54 


0 


09 1 


1. 


0 


1.2 


Least 


5 


814 


570 




-0.56 


0 


10 1 


1. 


1 


1.0 


Severe 


17 


127 


90 




-0.97 


0 


24 1 


1. 


2 


0.9 


Count: 


18 Mean: 


716.4 


1342. 


5 


0.00 


0. 


11 


1. 


0 


1.0 




S.D. : 


422.0 


310. 


5 


0.56 


0 


05 


0. 


2 


0.1 



RMSE 0.12 Adj S D. 0.55 Separation 4.49 Reliability 0.95 
Fixed (all same) chi-square : 382.04 d.f.: 17 significance 
(see Table I for definitions) 
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TABLE 3 

Severity of Judges on Oral Examination in 
Order of Severity 









Count of 


Judge 




Infit 


Outfit 




Judge 


Score 


Protocols 


Severity 


Error 


MnSq 

* 


MnSq 


Most 


33 


56 


39 1 


1.67 


0.25 1 


1.3 


1.3 


Severe 


29 


68 


39 1 


1.51 


0.26 1 


1.1 


1.0 




23 


87 


39 1 


1.40 


0.32 1 


0.6 


0.5 




49 


66 


33 1 


1.34 


0.32 1 


0.7 


0.6 




39 


75 


36 1 


1.32 


0.30 1 


1.4 


1.4 




10 


78 


42 1 


1.13 


0.27 1 


2.0 


2.2 




26 


75 


36 1 


1.08 


0.30 1 


0.9 


0.9 




40 


67 


33 1 


1.04 


0.31 1 


0.8 


0.9 




7 


76 


39 1 


0.74 


0.27 1 


0.6 


0.6 




43 


38 


24 1 


0.69 


0.37 1 


0.7 


1.2 




12 


79 


33 1 


0.65 


0.34 1 


0.9 


0.8 




18 


75 


36 1 


0.64 


0.30 1 


0.6 


0.6 




46 


74 


36 1 


0.63 


0.29 1 


0.9 


0.8 




8 


81 


42 1 


0.52 


0.26 1 


0.8 


0.7 




31 


77 


42 1 


0.48 


0.26 1 


0.6 


0.5 




15 


89 


39 1 


0.46 


0.30 1 


1.6 


1.6 




16 


62 


33 1 


0.45 


0.30 1 


0.6 


0.6 




47 


88 


42 1 


0.39 


0.28 1 


1.0 


1.0 




3 


66 


33 1 


0.36 


0.31 1 


1.2 


1.0 




44 


82 


40 1 


0.22 


0.28 1 


1.0 


0.9 




34 


86 


45 1 


-0.02 


0.25 1 


1.0 


1.0 




30 


77 


36 1 


-0.14 


0.31 1 


1.1 


1.1 




45 


99 


45 1 


-0.20 


0.27 1 


1.0 


0.8 




48 


81 


39 1 


-0.23 


0.29 1 


1.0 


0.9 




35 


68 


33 1 


-0.25 


0.31 1 


1.0 


1.1 




37 


75 


36 1 


-0.28 


0.31 1 


1.0 


1.0 




50 


78 


33 1 


-0.32 


0.34 1 


1.1 


1.2 




42 


82 


36 1 


-0.34 


0.33 1 


0.8 


0.7 




11 


72 


37 1 


-0.42 


0.29 1 


0.7 


0.8 




19 


88 


42 1 


-0.46 


0.27 1 


1.5 


1.6 




32 


79 


33 1 


-0.51 


0.34 1 


0.8 


0.8 




41 


66 


33 1 


-0.54 


0.31 1 


1.2 


1.1 




2 


73 


30 1 


-0.55 


0.37 1 


1.2 


1.3 




1 


102 


48 1 


-0.68 


0.26 1 


0.7 


0. ' 




14 


68 


33 1 


-0.68 


0.31 i 


1.1 


1.1 




51 


78 


36 1 


-0.75 


0.31 1 


0.6 


0.6 




13 


96 


42 1 


-0.77 


0.29 1 


1.0 


0.9 




6 


77 


36 1 


-0.83 


0.31 1 


0.7 


0.8 




20 


68 


30 1 


-0.83 


0.35 1 


1.2 


1.1 




28 


73 


33 1 


-0.85 


0.33 1 


0.9 


0.9 




36 


68 


30 1 


-0.91 


0.36 1 


1.4 


1 . 6 




52 


70 


30 1 


-1.10 


0.36 1 


0.7 


0.7 




21 


79 


36 1 


-1.11 


0.30 1 


0.7 


0.7 




53 


63 


27 1 


-1.17 


0.38 1 


0.7 


0.6 


Least 


25 


81 


36 i 


-1.20 


0.33 1 


1.6 


1.7 


Severe 


17 


/ ♦ 


36 1 


-1.58 


0.38 1 


1.3 


1,1 




Mean: 


76.0 


36.2 


0.00 


0.31 


1.0 


1.0 




S.D. : 


11.1 


4.8 


0.83 


0.03 


0.3 


0.3 



RMSE 0.31 Adj S.D. 0.78 Separation 2.51 Reliability 0.86 
Fixed (all same) chi-square: 345.12 d.f.: 45 significance : 0 . 00 
(see Table 1 for definitions) 
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TABLE 4 



Grading Severity Calibrations Across Time Periods 
For Clinical, Oral, Essay Examinations 









Grading 




Cons 


i s tencv 




Time 


Severity 




Infit 


Outfit 


Exajsi nation 


Period 


Calibrations* 


SE 


MnSq. 


MnSa 


Clinical 


1 


mominff 


- .03 


.04 






1.1 




2 


afternoon 


.30** 


.04 


1 

J. « 


0 


.9 




3 


morning 


-.05 


.03 


1. 


0 


1.0 




4 


afternoon 


- . 22** 


.06 


1. 


0 


.9 


Oral 


1 


morning 


.06 


.06 




9 


.9 




2 


afternoon 


- .05 


.07 


1. 


1 


1.1 




3 


morning 


-.01 


.17 


1. 


2 


1.1 


Essay 


1 


morning 


.05 


.07 


1. 


1 


1.1 




2 


afternoon 


.05 


.07 


1. 


1 


1.1 




3 


morning 


. 20** 


.07 




9 


.9 




4 


afternoon 


.02 


.07 


1. 


1 


1.1 




5 


morning 


-.02 


.07 


1. 


1 


1.1 




6 


afternoon 


-.11 


.07 


1. 


0 


1.0 




7 


morning 


- .11 


.07 




8 


.8 




8 


afternoon 


- .08 


.07 


1. 


1 


1.1 



* Positive calibration - more severe grading; 
negative calibration ~ more lenient grading 

* Statistically significant difference , chi-squcT'e analysis 
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GRAPH 1 
ESSAY EXAMINATION MEASURES 

TIME CALIBRATED VS. TIME UNCALIBRATED 

Time Calibrated 
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GRAPH 2 
CLINICAL EXAM MEASURES 

TIME CALIBRATED VS. TIME UNCALIBRATED 

Time Calibrated 
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GRAPH 3 
ORAL EXAMINATION MEASURES 

TIME CALIBRATED VS. TIME UNCALIBRATED 
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GRAPH 3A 
ORAL EXAMINATION MEASURES 

TIME CALIBRATED VS. TIME UNCALIBRATED 
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APPENDIX 1 



Probability of person n being given 
grade k by judge j on item i at 
time t 

Pnijtk-a Probability of person n being given 

grade k-1 by judge j on item i at time t 

- ability of person n 

- difficulty of item i 
Cj - severity of judge j 

« stringency of time t 

- difficulty of grading step k relative to step k-1 



The probability of a performance (B^) earning a particular measure depends 
upon the rating (k) awarded and the additive effects of the difficulty of the 
item (DJ the severity of the judge (Cj). the grading period (TJ and the 
difficulty of the grading step {F^) . Misfit statistics identify the 
particular gradings which are improbable and provide a check on the technical 
validity of the measures. This study focuses on judge severity and grading 
period, however, the other facets are also included in the equation to produce 
more precise estimates. 
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